home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
TeX 1995 July
/
TeX CD-ROM July 1995 (Disc 1)(Walnut Creek)(1995).ISO
/
biblio
/
bibtex
/
utils
/
bibindex
/
bibindex.c
< prev
next >
Wrap
C/C++ Source or Header
|
1993-05-31
|
33KB
|
1,063 lines
/* ================================================================= *\
bibindex -- a program to index bibtex files, used in conjunction
with biblook
This program was specifically developed for use with the
computational geometry bibliographic database, and assumes the
rules thereof. The database may be obtained by anonymous ftp
from cs.usask.ca in the file "pub/geometry/geombib.tar.Z".
Version 1.0 written by Jeff Erickson <jeff@ics.uci.edu>, 27 Mar 92
Version 2.0 written by Jeff Erickson <jeff@ics.uci.edu>, 17 Jun 92
This program is in the public domain. You may use it or modify
it to your heart's content, at your own risk. Bouquets, brickbats,
and bug fixes may be sent to Jeff Erickson, jeffe@cs.berkeley.edu.
%Make% gcc -O -o bibindex bibindex.c
Usage: bibindex bibfile [-i field ...]
-----------------------------------------------------------------
HOW IT WORKS:
The bibtex file is read field by field. The file offset beginning
each record and each record's citation key are recorded. A list of
words is extracted from each field. These words are placed into
tables, which remember which records contain them in their
respective fields. Once the file has been completely read, the
hash tables are compacted and sorted.
The hash tables are extensible, since we have to maintain one for
each possible field type, and static tables would be way too big.
Initially, each table holds 1K entries, and the tables are doubled
whenever they reach 75% capacity. Each table entry is at least 24
bytes. If the resulting hash tables use too much memory, the
entries should be changed to pointers, allocated on the fly.
The entry lists associated with each word are implemented as
extensible arrays. Initially, each list holds eight entries. If a
new entry is inserted into a full list, the list is doubled first.
The index file has the following format (loosely):
version info
# entries
array of offsets into bib file -- one per entry
# field types
array of field names -- one per field type
array of -- one per field type
# words
array of -- one per word
word -- in alphabetical order
# locations
array of entry #s -- one per location
There are advantages and disadvantages of having multiple hash
tables instead of a single table. I am starting with the premise
that the lookup program should be very fast. Consequently, I can't
make it determine which fields contain a given word. Doing so
would require putting ALL of the entry-parsing code into the lookup
program. It would also mean potentially parsing a lot of
extraneous entries to find relatively common words in relatively
uncommon places (eg, "title edelsbrunner").
If there were a single word table, each word list would have to
include bitmasks to indicate the appropriate fields along with the
entry numbers. Assuming between 16 and 32 field types (the CG bib
uses about 24), this would triple the size of each entry. On the
average, each word occurs in less than two field types. The
bitmask approach would also require knowledge of the field names in
advance; the multiple table approach does not.
-----------------------------------------------------------------
VERSION HISTORY:
1.0 <jge> 3/26/92 Initial version completed
1.1 <jge> 3/27/92 Words restricted to letters only; special
rules added for apostrophes, so that words
like "didn't" and "O'Rourke" can be handled
correctly.
1.2 <jge> 3/30/92 Faster hash function; now compressing hash
tables before sorting them. Execution time on
the CG bib reduced by a factor of thirty-five.
1.3 <jge> 4/2/92 Toyed with the hash function some more, trying
to reduce collisions, with limited success.
1.4 <jge> 4/17/92 Added exit(0) at the end of main() [I thought
that was supposed to be automatic!]
2.0 <jge> 6/12/92 First major revision completed.
1. Major change in file format -- word tables for every
field instead of just author, title, and keywords.
2. Word hash tables made extensible.
3. Fixed bug in GetNextWord that would cause the function
to return inside nested braces.
4. Fixed bug in MungeField that would kill everything in an
entry after an abbreviation. Abbreviations now go into
the appropriate hash table with the other words.
5. Made GetNextWord consider numbers and math expressions
to be words.
6. Added WriteWord, resulting in 40% savings in disk space.
7. Added -i flag and black holes. Ignoring "oldlabel"
gives another 15% savings (6/92 version of CGbib).
2.1 <jge> 7/9/92 Minor bug fixes.
2.2 Nelson H. F. Beebe <beebe@math.utah.edu> 03-Oct-1992
Testing with >100K lines of .bib files led to these changes:
1. Add support for complete BibTeX keyword name syntax with
iskeychar() function.
2. Add support for trailing comma after last key = "value" entry.
3. Record input line number for display in error messages.
4. Record initial line number of each BibTeX entry for
display in error messages to better localize error.
5. Add test for word buffer overflow in MungeField() to prevent
run-time core dumps, and increase supported word size from
15 to 31 (a word-length histogram of a 116921-word dictionary
found words up to 28 characters long, with 1% longer than 15).
File version increased to 2 because of word size change.
6. Add typecasts in calls to qsort() and comparisons of
unsigned short with short, change main() from void to int,
remove variable from initializer of msg[2], and add void to
IndexBibFile() definition to allow compilation with C++ as
well as C for better compile-time checking.
7. In MungeEntry(), do an ungetc() after key name collection.
Otherwise, a key="value" pair without a blank before the =
will not be recognized because the = read on lookahead has
not been put back.
8. Revise table compression code in OutputTables() and
code in FreeTables() to avoid duplicate frees, which is
a fatal error on many systems, and specified to result
in undefined behavior by the 1989 ANSI/ISO C Standard.
9. Define bcopy() as a macro to invoke standard memcpy()
instead.
10. Include time.h, unistd.h, and malloc.h to get proper
function declarations for library routines needed here.
11. Add DEBUG_MALLOC support.
12. Change several char* types to const char*.
13. Correct some typographical errors in comment.
14. Write UNIX manual pages for bibindex and biblook.
15. Allow command-line to specify a filename with .bib extension.
16. Add help support to biblook.
17. Correct error in FreeTables() in biblook.c; inner loop
incremented i instead of j.
2.3 Bill Jones <jones@cs.usask.ca> 93/01/29
1. Declarations common to bibindex.c and biblook.c factored out
to new file biblook.h.
2. Index type of (signed) short overflows early; created typedef
Index_t, defined as unsigned short.
3. Changed hash tables to extend at 75% capacity rather than 50%.
2.4 Nelson H. F. Beebe <beebe@math.utah.edu> [01-Jun-1993]
1. Remove some mixed-mode arithmetic.
2. Increase MAXFIELDS from 64 to 127 to deal with TeX User Group
bibliography collection
3. Correct error in GetHashTable(); older versions got into an
infinite loop if MAXFIELDS field names were already stored, and
a new one was encountered.
\* ================================================================= */
#include "biblook.h"
static long line_number = 1L; /* for debug messages */
static long initial_line_number = 1L;
/* ======================= UTILITY FUNCTIONS ======================= */
/* ----------------------------------------------------------------- *\
| void die(const char *msg1, const char *msg2) -- print an error message and die
\* ----------------------------------------------------------------- */
void die(const char *msg1, const char *msg2)
{
fprintf(stderr,
"\nError:\tin BibTeX entry starting at line %ld, \
error detected at line %ld:\n\t%s %s\n",
initial_line_number, line_number, msg1, msg2);
exit(1);
}
/* ----------------------------------------------------------------- *\
| char safegetc(FILE *fp, const char *where)
|
| Get the next character safely. Used by routines that assume that
| they won't run into the end of file.
\* ----------------------------------------------------------------- */
char safegetc(FILE *fp, const char *where)
{
char c;
if (feof(fp))
die("Unexpected end of file in", where);
c = getc(fp);
if (c == '\n')
line_number++;
return (c);
}
/* ----------------------------------------------------------------- *\
| void *safemalloc(unsigned howmuch, const char *msg1, const char *msg2)
|
| Allocate memory safely. Used by routines that assume they won't
| run out of memory.
\* ----------------------------------------------------------------- */
void *safemalloc(unsigned howmuch, const char *msg1, const char *msg2)
{
register void *tmp = NULL;
tmp = malloc(howmuch);
if (tmp == NULL)
die(msg1, msg2);
return tmp;
}
/* ====================== HASH TABLE FUNCTIONS ===================== *\
The hash tables start small and double whenever they reach 75%
capacity. Hashing is performed by going through the string one
character at a time, multiplying by a constant and adding in the
new character value each time. The constant is defined to give
the same even spread (about size/sqrt(2)) between successive
multiples, as long as the hash table size is a power of two.
Collisions are resolved by double hashing. Since the hash table
size is always a power of two, the secondary hash value has to be
odd to avoid loops.
The field tables are non-extensible hash tables, otherwise handled
the same way. It is probably well worth the effort to fine tune
the field table hash function in order to avoid collisions.
The field tables associated with ignored fields are black holes.
Everything is the same, except that InsertEntry doesn't actually
DO anything.
\* ================================================================= */
#define MAXFIELDS 127 /* intentionally way too much */
/* NB: limited by char type of numfields */
#define INIT_HASH_SIZE 256
#define HASH_CONST 1482907 /* prime close to 2^{20.5} */
typedef struct /* Hash table entry */
{
Word theword; /* the hashed word */
Index_t number; /* number of references in the list */
Index_t size; /* real size of reference list */
Index_t *refs; /* actual list of references */
} HashCell, *HashPtr;
typedef struct /* Extensible hash table */
{
Word thefield; /* the field type */
Index_t number; /* number of words in the hash table */
Index_t size; /* real size of the hash table */
HashPtr words; /* the actual hash table */
} ExHashTable;
static ExHashTable fieldtable[MAXFIELDS]; /* the field tables */
static char numfields; /* number of fields */
/* ----------------------------------------------------------------- *\
| void InitTables(void)
|
| Initialize the field tables
\* ----------------------------------------------------------------- */
void InitTables(VOID)
{
register unsigned short i;
numfields = 0;
for (i=0; i<MAXFIELDS; i++)
{
fieldtable[i].thefield[0] = 0;
fieldtable[i].number = 0;
fieldtable[i].size = 0;
fieldtable[i].words = NULL;
}
}
/* ----------------------------------------------------------------- *\
| void InitOneField(ExHashTable *htable)
|
| Initialize one field's hash table
\* ----------------------------------------------------------------- */
void InitOneField(register ExHashTable *htable)
{
Index_t i;
htable->number = 0;
htable->size = INIT_HASH_SIZE;
htable->words = (HashPtr) safemalloc(INIT_HASH_SIZE*sizeof(HashCell),
"Can't create hash table for",
htable->thefield);
for (i=0; i<INIT_HASH_SIZE; i++)
{
htable->words[i].theword[0] = 0;
htable->words[i].number = 0;
htable->words[i].size = 0;
htable->words[i].refs = NULL;
}
}
/* ----------------------------------------------------------------- *\
| void FreeTables(void)
|
| Free the tables
\* ----------------------------------------------------------------- */
void FreeTables(VOID)
{
register unsigned short i;
Index_t j;
for (i=0; i < (unsigned short)numfields; i++)
{
if (fieldtable[i].words)
{
for (j=0; j < fieldtable[i].number; j++)
{
if (fieldtable[i].words[j].refs)
free(fieldtable[i].words[j].refs);
}
free(fieldtable[i].words);
}
}
}
/* ----------------------------------------------------------------- *\
| ExHashTable *GetHashTable(char *field)
|
| Get the hash table associated with the given field. If the table
| is unclaimed, claim it and initialize it.
\* ----------------------------------------------------------------- */
ExHashTable *GetHashTable(char *field)
{
register unsigned long hash=0; /* primary hash value */
register unsigned long skip=1; /* secondary hash value */
register int i;
for (i=0; field[i]; i++)
{
hash = (hash*HASH_CONST + field[i]) % MAXFIELDS;
skip += 2*hash;
}
while (fieldtable[hash].thefield[0] && /* cell not empty and */
strcmp(fieldtable[hash].thefield, field)) /* not the right field */
{
hash = (hash+skip) % MAXFIELDS;
}
if (!fieldtable[hash].thefield[0])
{
strcpy(fieldtable[hash].thefield, field);
InitOneField(fieldtable+hash);
numfields++;
if (numfields >= MAXFIELDS) /* NB: NOT > because that produces */
/* an infinite while() loop above on */
/* next entry! */
die("too many field names",field);
}
return fieldtable+hash;
}
/* ----------------------------------------------------------------- *\
| void InitBlackHole(char *field)
|
| Intitialize a black hole for the given field.
\* ----------------------------------------------------------------- */
void InitBlackHole(char *field)
{
ExHashTable *hole;
hole = GetHashTable(field);
free(hole->words);
hole->words = NULL;
}
/* ----------------------------------------------------------------- *\
| HashPtr GetHashCell(ExHashTable *table, char *word)
|
| Get the hash table cell associated with the given word. If the cell
| is unclaimed, claim it, initialize it, and update the table's word
| count.
\* ----------------------------------------------------------------- */
HashPtr GetHashCell(ExHashTable *htable, char *word)
{
register HashPtr table, cell;
register unsigned long hash=0; /* primary hash value */
register unsigned long skip=1; /* secondary hash value */
register int i;
table = htable->words;
for (i=0; word[i]; i++)
{
hash = (hash*HASH_CONST + word[i]) % htable->size;
skip += 2*hash;
}
while (table[hash].theword[0] && /* cell not empty and */
strcmp(table[hash].theword, word)) /* not the right word */
{
hash = (hash+skip) % htable->size;
}
cell = table+hash;
if (!cell->theword[0])
{
strcpy(cell->theword, word);
cell->size = 8;
cell->refs = (Index_t *) safemalloc(cell->size * sizeof(Index_t),
"Can't create entry list for",
word);
htable->number++;
}
return cell;
}
/* ----------------------------------------------------------------- *\
| void ExtendHashTable(ExHashTable *htable)
|
| Double the size of the hash table and rehash everything.
\* ----------------------------------------------------------------- */
void ExtendHashTable(ExHashTable *htable)
{
register HashPtr newcell;
register HashPtr oldtable;
Index_t i;
Index_t oldsize;
oldsize = htable->size;
oldtable = htable->words;
htable->number = 0;
htable->size *= 2;
htable->words = (HashPtr) safemalloc(sizeof(HashCell)*htable->size,
"Can't extend hash table for",
htable->thefield);
for (i=0; i < htable->size; i++)
{
htable->words[i].theword[0] = 0;
htable->words[i].number = 0;
htable->words[i].size = 0;
htable->words[i].refs = NULL;
}
for (i=0; i< oldsize; i++)
{
if (oldtable[i].theword[0])
{
newcell = GetHashCell(htable, oldtable[i].theword);
*newcell = oldtable[i];
}
}
free(oldtable);
}
/* ----------------------------------------------------------------- *\
| void InsertEntry(ExHashTable *htable, char *word, Index_t entry)
|
| Insert the word/entry pair into the hash table, unless it's
| already there.
\* ----------------------------------------------------------------- */
void InsertEntry(ExHashTable *htable, char *word, Index_t entry)
{
register HashPtr cell;
Index_t *newlist;
if (htable->words == NULL) /* Is it a black hole? */
return;
if ((Index_t)(htable->number * 4) > (Index_t)(htable->size * 3))
ExtendHashTable(htable);
cell = GetHashCell(htable, word);
if (cell->number && (cell->refs[cell->number-1] == entry))
return;
if (cell->number == cell->size) /* expand the array */
{
cell->size *= 2;
newlist = (Index_t *) safemalloc(cell->size * sizeof(Index_t),
"Can't extend entry list for",
word);
bcopy(cell->refs, newlist, cell->number * sizeof(Index_t));
free(cell->refs);
cell->refs = newlist;
}
cell->refs[cell->number++] = entry;
if (cell->number <= 0)
die("hash type overflow", htable->thefield);
}
/* ================================================================= */
/* ----------------------------------------------------------------- *\
| void WriteWord(FILE *ofp, Word word)
|
| Output the word in "Pascal" string format -- length byte followed
| by characters. This saves some disk space over writing MAXWORD+1 bytes
| in all cases.
\* ----------------------------------------------------------------- */
void WriteWord(FILE *ofp, Word word)
{
unsigned char length = strlen(word);
fwrite((void *) &length, sizeof(char), 1, ofp);
fwrite((void *) word, sizeof(char), length, ofp);
}
/* ----------------------------------------------------------------- *\
| void OutputTables(FILE *ofp)
|
| Compress and output the tables, with lots of user feedback.
| KLUDGE -- Passing strcmp to qsort assumes intimate knowledge of
| the HashCell and ExhashTable structs.
\* ----------------------------------------------------------------- */
void OutputTables(FILE *ofp)
{
register HashPtr words;
register int i, k, count;
Index_t m, n;
printf("Writing index tables...");
fflush(stdout);
numfields = 0; /* recount, ignoring black holes */
for (i=0; i<MAXFIELDS; i++)
{
if (fieldtable[i].words)
{
if (i > numfields)
{
fieldtable[numfields] = fieldtable[i]; /* copy i-th table */
fieldtable[i].number = 0; /* then clear i-th table */
fieldtable[i].size = 0; /* to avoid duplicate free() later */
fieldtable[i].words = NULL;
}
numfields++;
}
}
qsort(fieldtable, (size_t)numfields, sizeof(ExHashTable),
(int (*)(const void*,const void*))strcmp);
fwrite((void *) &numfields, sizeof(char), 1, ofp);
for (i=0; i<numfields; i++)
WriteWord(ofp, fieldtable[i].thefield);
printf("[%d fields]\n", numfields);
for (k=0; k<numfields; k++)
{
printf("%2d: %s...", k+1, fieldtable[k].thefield);
fflush(stdout);
words = fieldtable[k].words;
for (m=0, n=0; m<fieldtable[k].size; m++)
{
if (words[m].theword[0])
{
if (m > n)
{
words[n] = words[m]; /* copy m-th table to n-th */
words[m].number = 0; /* then clear m-th table */
words[m].size = 0; /* to avoid duplicate free() later */
words[m].refs = (Index_t*)0;
}
n++;
}
}
qsort(words, (size_t)fieldtable[k].number, sizeof(HashCell),
(int (*)(const void*,const void*))strcmp);
fwrite((void *) &(fieldtable[k].number), sizeof(Index_t), 1, ofp);
count = 0;
for (m=0; m<fieldtable[k].number; m++)
{
WriteWord(ofp, words[m].theword);
fwrite((void *) &(words[m].number), sizeof(Index_t), 1, ofp);
fwrite((void *) words[m].refs, sizeof(Index_t),
words[m].number, ofp);
count += words[m].number;
}
printf("[%d words, %d references]\n", fieldtable[k].number, count);
}
}
/* ============================= INPUT ============================= *\
I'd like this to work with more than just the CG bib, so I can't
assume very much about the input. In particular, all white space
(blanks, tabs, and newlines) is identical most of the time. On the
other hand, it would be nice to include any "comments" that
obviously go with an entry as part of that entry. Consequently, two
newlines in a row (possibly separated by other white space) is
considered a break between entries. This will give us bogus entries
for stray "comments", but we can take care that easily enough -- an
entry is only real if it contains a @ character.
\* ================================================================= */
/* ----------------------------------------------------------------- *\
| unsigned long FindNextEntry(FILE *ifp)
|
| Return the file offset to the next entry in the bib file. On exit,
| the file pointer is left at the first comma. Entries containing no
| commas, such as @string, are skipped. The entry begins after the
| most recent blank line, the end of the last entry, or the beginning
| of the file. Returns "-1" if there is no next entry.
\* ----------------------------------------------------------------- */
unsigned long FindNextEntry(FILE *ifp)
{
char ch;
char blank = 0; /* 1 if current line is blank so far */
char braces = 0;
char quotes = 0;
unsigned long offset;
offset = ftell(ifp);
ch = getc(ifp);
if (ch == '\n')
line_number++;
while (1)
{
if (feof(ifp))
return (unsigned long) -1;
if (ch == '@') /* got a potential entry */
{
initial_line_number = line_number; /* record start of entry */
blank = 0;
while ((ch != '{') && (ch != '('))
ch = safegetc(ifp, "FindNextEntry 1"); /* EOF is an */
/* error here */
quotes = 0;
braces = 0;
ch = safegetc(ifp, "FindNextEntry 2");
while (quotes || braces ||
((ch != ',') && (ch != '}') && (ch != ')')))
{
if (ch == '{')
braces++;
else if (ch == '}')
braces--;
else if ((ch == '"') && !braces)
quotes = !quotes;
ch = safegetc(ifp, "FindNextEntry 3");
}
if (ch == ',')
{
ungetc(ch, ifp);
return offset;
}
else
offset = ftell(ifp);
}
else if (ch == '\n')
{
if (blank)
offset = ftell(ifp);
blank = 1;
}
else if (!isspace(ch))
blank = 0;
ch = getc(ifp);
if (ch == '\n')
line_number++;
}
}
/* ----------------------------------------------------------------- *\
| int GetNextWord(FILE *ifp, char *word)
|
| Get the next word in the current field. A word is any contiguous
| set of letters and numbers, AFTER the following steps:
| 1. Letters are folded to lower case. Thus, "Voronoi" is
| returned as "voronoi"
| 2. All TeX commands, except those in math expressions, are
| removed, but their arguments are left behind. Thus,
| "Erd{\H o}ss" is returned as "erdos".
| 3. All other non-word characters are removed. Non-word
| characters inside {{possibly} nested} braces or dollar
| signs do not delimit words, so they may cause unexpected
| results. Thus, "{this example}" is returned as
| "thisexample".
| 4. TeX commands in math expressions are considered normal
| text. Thus, "$O(n\log^2 n)$" is returned as "onlog2n"
| instead of "onn". This occasionally gives unexpected or
| unreadable results. For example, "$\bigcup_1^n[a_i,b_i]$"
| is returned as "bigcup1naibi".
| 5. Apostrophes do not delimit words. Thus, "didn't" is
| returned as "didnt", and "{\'O}'D{\'u}nlaing" is returned
| as "odunlaing".
|
| Returns 1 if a word was actually found, 0 otherwise.
|
| The file pointer is left at the character immediately following the
| word. The input is assumed to be syntactically correct: unbalanced
| braces, math delimiters, and quotation marks will cause errors.
\* ----------------------------------------------------------------- */
int GetNextWord(FILE *ifp, char *word)
{
char braces = 0; /* levels of indented braces */
char math = 0; /* 1 if in math expression */
char inword = 0; /* 1 if word has started */
char incmd = 0; /* 1 if reading TeX command */
char done = 0; /* 1 if word is complete */
char ch = ' ';
int nchars = 0; /* how many characters stored in word */
while (!done)
{
ch = safegetc(ifp, "GetNextWord 1");
if (isalpha(ch)) /* letters */
{
if (!incmd) /* may be part of TeX command */
{
inword = 1;
if (++nchars <= MAXWORD) /* ignore overflow in word buffer */
*word++ = tolower(ch);
}
}
else if (isdigit(ch))
{
incmd = 0;
inword = 1;
if (++nchars <= MAXWORD) /* ignore overflow in word buffer */
*word++ = ch;
}
else if (ch == '\'') /* ignore apostrophes entirely */
incmd = 0;
else if (ch == '\\') /* beginning of TeX command */
{
if (!math) /* if math, pretend it's text */
{
ch = safegetc(ifp, "GetNextWord 2");
if (isalpha(ch))
incmd = 1;
}
}
else if (ch == '{') /* left brace */
{
incmd = 0;
braces++;
}
else if (ch == '}') /* right brace */
{
incmd = 0;
if (!braces)
{
ungetc(ch, ifp);
done = 1;
}
else
braces--;
}
else if (ch == '"') /* double quote */
{
incmd = 0;
if (!braces)
{
ungetc(ch, ifp);
done = 1;
}
}
else if (ch == '$') /* math */
{
incmd = 0;
math = !math;
if (math) /* treat $s as braces */
braces++;
else
braces--;
}
else if (incmd) /* other characters */
incmd = 0;
else if (inword && !braces)
done = 1;
}
*word = 0;
return inword;
}
const char *badwords[] = BADWORDS;
/* ----------------------------------------------------------------- *\
| void MungeField(FILE *ifp, ExHashTable *htable, Index_t entry)
|
| Munge the current field. Dump all words into the given hash table,
| with the given entry number. On exit, the file pointer is on the
| comma or closing character for the entry.
\* ----------------------------------------------------------------- */
void MungeField(FILE *ifp, ExHashTable *htable, Index_t entry)
{
int i;
char ch, bad;
char nextword[MAXWORD+1];
ch = safegetc(ifp, "MungeField");
while (isspace(ch))
ch = safegetc(ifp, "MungeField");
if (ch != '=')
die("= expected in bibliography field", htable->thefield);
ch = safegetc(ifp, "MungeField");
while (isspace(ch))
ch = safegetc(ifp, "MungeField");
if (ch != '{' && ch != '"') /* handle abbreviations */
{
for (i=0; !strchr(NONABBREV, ch); i++)
{
if (i >= MAXWORD)
die("word buffer overflow", htable->thefield);
nextword[i] = tolower(ch);
ch = safegetc(ifp, "MungeField");
}
nextword[i] = 0;
nextword[MAXWORD] = 0;
InsertEntry(htable, nextword, entry);
}
else
{
while (GetNextWord(ifp, nextword))
{
if (nextword[1] == 0) /* ignore single letters */
continue;
for (bad=0, i=0; !bad && badwords[i]; i++)
if (!strcmp(nextword, badwords[i]))
bad = 1;
if (!bad)
{
nextword[MAXWORD] = 0; /* truncate to fit */
InsertEntry(htable, nextword, entry);
}
}
ch = safegetc(ifp, "MungeField"); /* close quote/brace */
ch = safegetc(ifp, "MungeField"); /* comma (maybe) */
}
while (isspace(ch))
ch = safegetc(ifp, "MungeField");
ungetc(ch, ifp); /* comma or close paren/brace */
}
/* ----------------------------------------------------------------- *\
| int iskeychar(char c,int first_char)
|
| Return 1 if c is a valid keyword character, and 0 otherwise.
| The rules are different for the first character, so first_char
| must be set non-zero to select that case.
|
| The code in bibtex.web in the section on character set dependencies
| creates a character classification array, id_class[], which permits
| the following characters (verified by experiment too) in field
| names:
|
| A-Z a-z 0-9
| ! $ & + - . / : ; < > ? @ [ \ ] ^ _ ` | ~ <DELete>
|
\* ----------------------------------------------------------------- */
int iskeychar(char c,int first_char)
{
if (first_char)
return (isalpha(c) ? 1 : 0);
else if (isalpha(c) || isdigit(c))
return (1);
else if (strchr("!$&+-./:;<>?@[\\]^_`|~\177",(int)c) != (char*)NULL)
return (1);
else
return (0);
}
/* ----------------------------------------------------------------- *\
| void MungeEntry(FILE *ifp, Index_t entry)
|
| Wander through the entry, extracting words and depositing them in
| the appropriate hash tables.
\* ----------------------------------------------------------------- */
void MungeEntry(FILE *ifp, Index_t entry)
{
register char ch;
Word thefield;
short i;
ch = safegetc(ifp, "MungeEntry");
while (ch == ',')
{
ch = safegetc(ifp, "MungeEntry");
while (isspace(ch))
ch = safegetc(ifp, "MungeEntry");
if ((ch == '}') || (ch == ')')) /* allow trailing comma after */
return; /* last key = "value" entry */
if (!isalpha(ch))
{
char msg[2];
msg[0] = ch;
msg[1] = 0;
die("Nonalpha character in field descriptor:", msg);
}
for (i=0; iskeychar(ch,i==0); i++)
{
if (i >= MAXWORD)
{
thefield[i] = 0;
die("field name buffer overflow", thefield);
}
thefield[i] = tolower(ch);
ch = safegetc(ifp, "MungeEntry");
}
ungetc(ch, ifp); /* put back lookahead char */
thefield[i] = 0;
MungeField(ifp, GetHashTable(thefield), entry);
ch = safegetc(ifp, "MungeEntry");
}
}
/* ============================= OUTPUT ============================ */
/* ----------------------------------------------------------------- *\
| IndexBibFile(FILE *ifp, FILE *ofp, char *filename)
|
| Index a bibtex file. Input comes from ifp; output goes to ofp.
| Filename is the name of the bibliography, with no prefix.
\* ----------------------------------------------------------------- */
void
IndexBibFile(FILE *ifp, FILE *ofp, char *filename)
{
Index_t count=0;
long curoffset;
long *offsets;
long *oldoff;
Index_t offsize;
time_t now = time(0);
printf("Indexing %s.bib...", filename);
fflush(stdout);
offsize = 128; /* MINIMUM OFFSET LIST SIZE */
offsets = (long *) malloc(offsize * sizeof(long));
while (!feof(ifp))
{
curoffset = FindNextEntry(ifp);
if (curoffset == (unsigned long) -1)
break;
offsets[count] = curoffset;
MungeEntry(ifp, count++);
if (count == offsize) /* expand full offset array */
{
oldoff = offsets;
offsize *= 2;
offsets = (long *) malloc(offsize * sizeof(long));
bcopy(oldoff, offsets, count * sizeof(long));
free(oldoff);
}
if (!(count % 500)) /* feedback */
{
printf("%d...", count);
fflush(stdout);
}
}
printf("done.\n");
fprintf(ofp, "bibindex %d %d %d %s", FILE_VERSION, MAJOR_VERSION,
MINOR_VERSION, ctime(&now));
printf("Writing file offsets...");
fflush(stdout);
fwrite((void *) &count, sizeof(Index_t), 1, ofp);
fwrite((void *) offsets, sizeof(long), count, ofp);
free(offsets);
printf("[%d entries]\n", count);
OutputTables(ofp);
printf("All done!\n");
}
/* ----------------------------------------------------------------- *\
| The main program
\* ----------------------------------------------------------------- */
int main(int argc, char **argv)
{
FILE *ifp;
FILE *ofp;
char infile[256];
char outfile[256];
char *p;
int i;
#if DEBUG_MALLOC
malloc_debug(2);
#endif /* DEBUG_MALLOC */
if (argc < 2)
die("Usage: bibindex bib [-i field...]", "");
if (((p = strrchr(argv[1],'.')) != (char*)NULL) &&
(strcmp(p, ".bib") == 0))
*p = '\0'; /* remove any .bib extension */
sprintf(infile, "%s.bib", argv[1]);
sprintf(outfile, "%s.bix", argv[1]);
ifp = fopen(infile, "r");
if (!ifp)
die("Can't read", infile);
ofp = fopen(outfile, "w");
if (!ofp)
die("Can't write", outfile);
InitTables();
if ((argc > 2) && (!strcmp(argv[2], "-i")))
for (i=3; i<argc; i++)
InitBlackHole(argv[i]);
IndexBibFile(ifp, ofp, argv[1]);
FreeTables();
fclose(ifp);
fclose(ofp);
exit(0); /* Argh! */
return (0); /* keep compilers happy */
}